We decided to analyze Sports data in this project, specifically data from Formula 1 races. We have included a brief overview of the data as well as a data dictionary in the following sections.
Following this are sections that attempt to use various methods of regression and classification to answer some research questions about the data. The methods included are:
These methods were split equally between the partners. Nandini Bhelke worked on the first three regression methods, while Kevin Hallissey worked on the last three classification methods.
Citations
F1 Data and Images:
MLR References:
Ridge Regression References:
The data for this project was sourced from OpenF1 (https://openf1.org/). OpenF1 is a free and open-source API that provides real-time and historical Formula 1 data. The data from this website can be accessed in either JSON or CSV formats through the web browser. The appropriate CSV files were sourced from this website and processed to form a singular, complete dataset for this Project. The specific files used were:
These separate CSV files were then processed and combined to form our final dataset. A data dictionary with all variables is provided to the right. The complete dataset can also be accessed at this link: https://drive.google.com/file/d/1Bvc_8Os35966WgIHAyXb80IT2ddCGT2A/view?usp=drive_link
| Variable Name | Description |
|---|---|
| year | The year the event takes place |
| meeting_key | The unique identifier for the meeting |
| meeting_name | The name of the meeting |
| meeting_official_name | The official name of the meeting |
| session_key | The unique identifier for the session |
| session_name | The name of the session (Practice 1, Qualifying, Race, …) |
| country_key | The unique identifier for the country where the event takes place |
| country_name | The full name of the country where the event takes place |
| driver_number | The unique number assigned to an F1 driver |
| first_name | The driver’s first name |
| last_name | The driver’s last name |
| name_acronym | Three-letter acronym of the driver’s name |
| team_name | Name of the driver’s team |
| avg_duration | The average of the total time taken, in seconds, to complete the entire lap |
| avg_duration_sec1 | The average of the time taken, in seconds, to complete the first sector of the lap |
| avg_duration_sec2 | The average of the time taken, in seconds, to complete the second sector of the lap |
| avg_duration_sec3 | The average of the time taken, in seconds, to complete the third sector of the lap |
| avg_i1_speed | The average of the speed of the car, in km/h, at the first intermediate point on the track |
| avg_i2_speed | The average of the speed of the car, in km/h, at the second intermediate point on the track |
| avg_st_speed | The average speed of the car, in km/h, at the speed trap, which is a specific point on the track where the highest speeds are usually recorded |
| avg_duration_start | The average of the total time taken, in seconds, to complete the entire lap for only the first 5 laps |
| avg_duration_sec1_start | The average of the time taken, in seconds, to complete the first sector of the lap for only the first 5 laps |
| avg_duration_sec2_start | The average of the time taken, in seconds, to complete the second sector of the lap for only the first 5 laps |
| avg_duration_sec3_start | The average of the time taken, in seconds, to complete the third sector of the lap for only the first 5 laps |
| avg_i1_speed_start | The average of the speed of the car, in km/h, at the first intermediate point on the track for only the first 5 laps |
| avg_i2_speed_start | The average of the speed of the car, in km/h, at the second intermediate point on the track for only the first 5 laps |
| avg_st_speed_start | The average speed of the car, in km/h, at the speed trap, which is a specific point on the track where the highest speeds are usually recorded for only the first 5 laps |
| max_speed | The max speed of the car, in km/h, from speeds recorded at the speed trap |
| position | Final position of the driver (starts at 1) |
Research Question: Can we predict the average speed trap speed over all laps in a Grand Prix Race using the speed trap speeds for the practice and qualifying laps?
For the Multiple Linear Regression (MLR), we focused on predicting speeds for a race using the practice and qualifying race speeds. Specifically, we focused on the speeds at the speed traps, which are a specific points on the track where the highest speeds are usually recorded. First, we divided the dataset into Practice/Qualifying Races and acquired the average speed trap speeds for those races. We then matched these with the average speed trap speeds for the Final Races to try to see if we could use the practice and qualifying races to predict speeds in the final races. Our original model equation was as follows:
\[ \textbf{Race Speed} = \beta_0 + \beta_1 \text{P}_1 + \beta_2 \text{P}_2 + \beta_3 \text{P}_3 + \beta_4 \text{Q}\]
where \(\text{P}_1\) corresponds to the average speed for all laps for Practice 1, \(\text{P}_2\) for Practice 2, \(\text{P}_3\) for Practice 3, and \(\text{Q}\) Qualifying laps.
After model selection using stepAIC(), our final model
was:
\[\begin{align} \textbf{Race Speed} &= \beta_0 + \beta_1 \text{P}_2 + \beta_2 \text{P}_3 \\ &= 59.0857 + 0.6389 \textbf{P}_2 + 0.1901 \textbf{P}_3 \end{align}\]
The \(R^2\) values for both models are:| model | R2 |
|---|---|
| Full model | 0.3849 |
| AIC model | 0.3870 |
The \(R^2\) value did increase after model selection, but only by a marginal amount. Therefore, we can see that the model does not account for all the variance in the actual data values. This was reflected in the \(R^2\) value of \(0.387\). Thus, we can conclude that our model only accounts for about 38.7% of the total variance in the actual values for race speeds.
The regression plots with correlations are also shown to the right with the response variable as average race speed in km/h. As seen in the plot, although there is some correlation between the variables, the variance is too high fo the model to accurately fit the data.
The assumptions of the multiple linear regression models were also checked. These plots are included to the right. The model seems to hold for all assumptions of equal variance for residuals, independence, and normality based on these plots. We also checked for multicollinearity by taking a closer look at the correlation between variables. The correlation plot is included to the right, which shows that multicollinearity was not an issue in this model since the correlations are under 0.8.
Conclusion: Although the model is valid, We can conclude that this is in fact not a good regression model, since we cannot accurately predict average race speeds based on the average speeds for practice and qualifying laps.
Call:
lm(formula = race_speed ~ p2_speed + p3_speed, data = mlr_df)
Residuals:
Min 1Q Median 3Q Max
-127.041 -14.345 3.637 17.877 59.784
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 59.08575 13.23099 4.466 1.04e-05 ***
p2_speed 0.63893 0.07000 9.128 < 2e-16 ***
p3_speed 0.19010 0.06123 3.105 0.00204 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 26.01 on 394 degrees of freedom
Multiple R-squared: 0.3901, Adjusted R-squared: 0.387
F-statistic: 126 on 2 and 394 DF, p-value: < 2.2e-16
Research Question: Can we predict the overall average lap duration using the average lap duration times of the track sections?
We decided to use Ridge Regression to try to predict the overall average lap duration using the average lap duration times at the three different sections in the track. We chose these variables for Ridge Regression due to multicollinearity, since the variables were highly correlated. This can be seen in the correlation matrix provided to the right.
The coefficients plot shows that higher choices for lambda values cause the beta values to approach zero. The minimum and optimal lambda values are also included in this plot. The Optimal Lambda plot also shows the process of picking the optimal lambda that minimizes the mean squared error.
The Model Fit summary to the right gives us the coefficients of the model using the Ridge Regression method. We can also see that the optimal lambda value is about 5.16. In order to determine the goodness of fit of this model, we used our fitted model to get the predicted y values. These were then used to calculate the sum of squared errors in order to get the \(R^2\) value, which is given below.
| model | R2 |
|---|---|
| Ridge Regression model | 0.8648704 |
This means that our Ridge Regression model accounts for 86.48% of the variance in the response variable, or the average lap duration.
Conclusion: Thus, we can conclude that this is a valid model to use in predicting the overall average lap duration using average lap duration times for each sector of the track.
The optimal \(\lambda\) value using
glmnet() is 5.162448.
The coefficients of the Ridge Regression model using the optimal lambda value are:
| name | val |
|---|---|
| (Intercept) | 74.7236544 |
| avg_duration_sec1 | 0.7981413 |
| avg_duration_sec2 | 0.1900571 |
| avg_duration_start | 0.1726709 |
| avg_duration_sec1_start | -0.1170244 |
| avg_duration_sec2_start | -0.3116496 |
The \(R^2\) value of this model is 0.8648704
Research Question: Can we predict the average speed recorded at the speed traps using the average speed recorded at position 2 in the track?
We decided to fit a LOESS regression to predict the average speed
recorded at the speed traps over all laps (avg_st_speed)
using the average speed recorded at position 2 on the track
(avg_i2_speed). The fitted LOESS plot is given to the
right. As we can see, there is some variance in this model fit, and it
seems to contain some outliers. In order to assess the goodness of fit
of this model, we calculated the mean squared error value using the
residuals, which was 959.2609. In order to put this value into
perspective, we fit a simple linear regression model to the same values
and calculated the mean squared error of that to see if LOESS performed
any better. These MSE values are were as follows:
| method | MSE |
|---|---|
| LOESS | 959.2609 |
| Simple Linear Regression | 1093.7860 |
As we can see, the MSE value is lower for LOESS, so we can say that the LOESS model fits the data better. However, is this model even valid? To check the model validity, we made a Residuals vs Fitted plot to observe the variance os residuals, which is given to the right. Although clustered near the right side of the plot, possibly due to outliers, the residuals seem to be randomly scattered around zero, so we can say that the equal variance assumption holds. Next, we decided to check for normality, which raised some issues. Both the Normal Q-Q plot and the Shapiro Wilk Normality test were utilized, as seen to the right under the “Normality Assumption” section. The model did not pass the normality test, so we have to say that the model is actually invalid to draw any predictions from. Looking at the trends of the data based on other plots, we believe that some outliers might be interfering with the model fitting, and causing the normality to fail since the Q-Q plot shows only a few points near the origin that skew the normality.
Conclusion: This model is not a valid model to use for predicting the average speed recorded at the speed traps using the average speed recorded at position 2 in the track. In order to handle this, we can utilize methods such as Cook’s distance or the Box-Cox method to get rid of any outliers or high leverage points to get a more valid model fit.
LOESS Fit MSE: 959.2609027
Simple Linear Model Fit MSE: 1093.7857316
Call:
loess(formula = y ~ x, data = data.bind, span = span1, degree = degree,
family = family)
Number of Observations: 2563
Equivalent Number of Parameters: 18.45
Residual Standard Error: 31.13
Trace of smoother matrix: 21.9 (exact)
Control settings:
span : 0.09256002
degree : 1
family : gaussian
surface : interpolate cell = 0.2
normalize: TRUE
parametric: FALSE
drop.square: FALSE
Shapiro-Wilk normality test
data: loessfit$residuals
W = 0.95263, p-value < 2.2e-16
Research Question: Are we able to accurately predict whether a racer will place in the top 10 of a given race/practice run?
\(H_0\): The null model is our best model.
\(H_1\): A k-NN model is a more accurate model to predict the outcome of the racer (top-10).
In the k-Nearest Neighbors model we designed, we decided to include all non-redundant information (i.e. we eliminated numeric codes corresponding to other columns) to give our model the best chance of finding patterns. After encoding both the meeting names and locations into dummy variables, we ran the model and achieved a fairly high level of accuracy above what we had hoped for. Our best was achieving:
| Accuracy | K-Value |
|---|---|
| 0.7069351 | 1 |
The first plot on the is a plot of the accuracies of k-values from 1-15. We can see a clear trend down so we did not feel the need to continue with higher-k values, especially since we were trying to predict “winners” out of groups of 20.
One indicator of why this model may have been successful is because of the nature of the data. On the right we have three scatterplots that show the times (standardized) plotted against each other and can see large amounts of grouping. This is ideal for the k-NN algorithm, especially considering it also is able to use the location variables to break down the data to even smaller clusters of neighbors. For a better idea of how well our model did, there is a visualization of the confusion matrix as the last plot on the right.
Since there are 20 positions, and the null model would simply choose yes or no for all racers, our model only had to beat an accuracy of 50%, however it achieved an amazing 98%. Thus, we have sufficient statistical evidence to state that a k-NN model is a superior model for predicting whether or not a racer will place top 10 in a given race/practice.
| avg_duration | avg_duration_sec1 | avg_duration_sec2 | avg_duration_sec3 | avg_i1_speed | avg_st_speed |
|---|---|---|---|---|---|
| Min. :-3.46472 | Min. :-2.20142 | Min. :-3.254584 | Min. :-2.261067 | Min. :-2.93446 | Min. :-8.06515 |
| 1st Qu.:-0.78539 | 1st Qu.:-0.95230 | 1st Qu.:-0.576076 | 1st Qu.:-0.532085 | 1st Qu.:-0.77858 | 1st Qu.:-0.50528 |
| Median :-0.05654 | Median : 0.01575 | Median : 0.008087 | Median :-0.162523 | Median : 0.22632 | Median : 0.15729 |
| Mean :-0.10000 | Mean :-0.09031 | Mean : 0.008531 | Mean : 0.003542 | Mean : 0.02105 | Mean : 0.02362 |
| 3rd Qu.: 0.43281 | 3rd Qu.: 0.56529 | 3rd Qu.: 0.722877 | 3rd Qu.: 0.314717 | 3rd Qu.: 0.83606 | 3rd Qu.: 0.75674 |
| Max. : 4.87405 | Max. : 3.58951 | Max. : 2.697585 | Max. : 4.623260 | Max. : 2.09781 | Max. : 1.71540 |
| avg_duration | avg_duration_sec1 | avg_duration_sec2 | avg_duration_sec3 | avg_i1_speed | avg_st_speed |
|---|---|---|---|---|---|
| Min. :-2.76598 | Min. :-1.82850 | Min. :-3.247409 | Min. :-2.080458 | Min. :-3.43133 | Min. :-3.88864 |
| 1st Qu.:-0.57024 | 1st Qu.:-0.85903 | 1st Qu.:-0.643692 | 1st Qu.:-0.566149 | 1st Qu.:-0.72852 | 1st Qu.:-0.54257 |
| Median : 0.24093 | Median : 0.25554 | Median : 0.005930 | Median :-0.192201 | Median : 0.21292 | Median : 0.16065 |
| Mean : 0.09619 | Mean : 0.08687 | Mean :-0.008206 | Mean :-0.003407 | Mean :-0.02025 | Mean :-0.02272 |
| 3rd Qu.: 0.68983 | 3rd Qu.: 0.80170 | 3rd Qu.: 0.751916 | 3rd Qu.: 0.293268 | 3rd Qu.: 0.76496 | 3rd Qu.: 0.67046 |
| Max. : 3.69800 | Max. : 2.78013 | Max. : 2.599732 | Max. : 4.678484 | Max. : 1.89016 | Max. : 1.59535 |
Research Question: Is a Naive Bayes model better at predicting which round of practice a given entry is than the null model?
\(H_0\): The null model is the best model for predicting which round of practice an entry is.
\(H_1\): The Naive Bayes model is better at classifying which round of practice an entry is than the null model.
For the Naive Bayes model, we were interested to see if we could predict which round of practice an entry in the data was. For (almost) every race, there will be multiple rounds of practice where the drivers can warm up both the car and themselves to get ready to drive. We expected a higher average speed as the drivers got more practice in, so thus we expected Practice 1 (round 1) to have a lower transformed minimum and maximum than Practice 3 (log transform due to high spread). However, in our summary statistics tables for each, you can see that while the minimums for each numeric variable tend to increase, the maximums either decrease or stay roughly the same. After seeing this, we decided to also include the meeting names to try to help the model know what the distribution of practice rounds was for each location. You can see this in the second chart on the right.
Overall the model didn’t do very well which was to be expected. The numerical predictors had very little difference between each round of practice despite having such a large spread. The inclusion of the locations helped, especially with locations that only had 1 round of practice such as the US Grand Prix and the Qatar Grand Prix, however the model still had limited success. Looking at the confusion matrix which is the last chart on the right, we see that the model predicted Practice 2 most of the time, and was unable to differentiate Practice 3 from the other two Practice rounds. It had a decent detection rate of Practice 1, but overall the model accuracy was rather low:
| Accuracy |
|---|
| 0.4438903 |
While this is a fairly low accuracy, we can see that the highest percentage of practice rounds is Practice 1 at around 40%%, which would be the best the null model could achieve by always picking one classification. Thus, we have sufficient statistical evidence to reject the null hypothesis and say that the Naive Bayes model is a better model for predicting which Practice round a given entry is.
| avg_duration | avg_duration_sec1 | avg_duration_sec2 | avg_duration_sec3 | avg_i1_speed | avg_st_speed |
|---|---|---|---|---|---|
| Min. :3.273 | Min. :2.678 | Min. :2.862 | Min. :2.699 | Min. :4.747 | Min. :4.220 |
| 1st Qu.:4.913 | 1st Qu.:3.983 | 1st Qu.:3.521 | 1st Qu.:3.310 | 1st Qu.:5.252 | 1st Qu.:5.464 |
| Median :5.041 | Median :4.366 | Median :3.646 | Median :3.454 | Median :5.435 | Median :5.535 |
| Mean :5.025 | Mean :4.281 | Mean :3.632 | Mean :3.558 | Mean :5.381 | Mean :5.504 |
| 3rd Qu.:5.143 | 3rd Qu.:4.590 | 3rd Qu.:3.850 | 3rd Qu.:3.643 | 3rd Qu.:5.506 | 3rd Qu.:5.600 |
| Max. :6.704 | Max. :6.641 | Max. :4.234 | Max. :5.164 | Max. :5.673 | Max. :5.753 |
| avg_duration | avg_duration_sec1 | avg_duration_sec2 | avg_duration_sec3 | avg_i1_speed | avg_st_speed |
|---|---|---|---|---|---|
| Min. :4.156 | Min. :2.643 | Min. :2.960 | Min. :3.032 | Min. :4.974 | Min. :5.090 |
| 1st Qu.:4.854 | 1st Qu.:4.058 | 1st Qu.:3.504 | 1st Qu.:3.286 | 1st Qu.:5.249 | 1st Qu.:5.483 |
| Median :4.945 | Median :4.267 | Median :3.618 | Median :3.408 | Median :5.400 | Median :5.544 |
| Mean :4.987 | Mean :4.187 | Mean :3.598 | Mean :3.571 | Mean :5.365 | Mean :5.526 |
| 3rd Qu.:5.070 | 3rd Qu.:4.424 | 3rd Qu.:3.808 | 3rd Qu.:3.610 | 3rd Qu.:5.489 | 3rd Qu.:5.605 |
| Max. :6.337 | Max. :5.676 | Max. :3.978 | Max. :5.538 | Max. :5.719 | Max. :5.718 |
| avg_duration | avg_duration_sec1 | avg_duration_sec2 | avg_duration_sec3 | avg_i1_speed | avg_st_speed |
|---|---|---|---|---|---|
| Min. :4.036 | Min. :2.734 | Min. :2.932 | Min. :3.075 | Min. :4.851 | Min. :4.926 |
| 1st Qu.:4.961 | 1st Qu.:4.185 | 1st Qu.:3.510 | 1st Qu.:3.338 | 1st Qu.:5.255 | 1st Qu.:5.448 |
| Median :5.098 | Median :4.533 | Median :3.651 | Median :3.483 | Median :5.417 | Median :5.534 |
| Mean :5.104 | Mean :4.390 | Mean :3.629 | Mean :3.598 | Mean :5.376 | Mean :5.507 |
| 3rd Qu.:5.240 | 3rd Qu.:4.791 | 3rd Qu.:3.835 | 3rd Qu.:3.689 | 3rd Qu.:5.501 | 3rd Qu.:5.598 |
| Max. :6.013 | Max. :5.894 | Max. :4.050 | Max. :4.979 | Max. :5.705 | Max. :5.744 |
| Practice 1 | Practice 2 | Practice 3 | Predicted Class | True Class |
|---|---|---|---|---|
| 0.1432 | 0.3111 | 0.5457 | Practice 3 | Practice 2 |
| 0.4642 | 0.0133 | 0.5225 | Practice 3 | Practice 2 |
| 0.1632 | 0.2309 | 0.6058 | Practice 3 | Practice 2 |
| 0.1758 | 0.1984 | 0.6257 | Practice 3 | Practice 3 |
| 0.1126 | 0.5630 | 0.3244 | Practice 2 | Practice 3 |
| 0.1211 | 0.4071 | 0.4718 | Practice 3 | Practice 3 |
| 0.1293 | 0.3463 | 0.5244 | Practice 3 | Practice 3 |
| 0.1323 | 0.3746 | 0.4931 | Practice 3 | Practice 1 |
| 0.2524 | 0.0897 | 0.6580 | Practice 3 | Practice 1 |
| 0.2214 | 0.1247 | 0.6539 | Practice 3 | Practice 1 |
Research Question: Is a Logistic Regression model better than the null model at predicting whether an entry is a race or a practice?
\(H_0\): No, the null model is the better model for predicting whether an entry is a race or practice.
\(H_1\): Yes, the Logistic Regression model is the better model for predicting whether an entry is a race or practice.
As specified above, here we are interested in classifying the entries into Race and Practice. Overall there are 3 different race types: Race, Sprint, and Sprint Shootout, as well as 4 Practice types: Practice 1, Practice 2, Practice 3, and Qualifying. We originally used the full model with all the available variables, but since Logistic Regression is a form of the glm (generalized linear model) function, we were able to use stepwise regression to narrow down the variables. The remaining variables included all of the numerical variables, as well as the year and meeting place name.
On the right, we show a breakdown of Race vs Practice percentages in the data, as well as graphs of the numerical data colored by class. Our last chart on the right is the confusion matrix showing that the model had a very high detection and success for the “Race” class but a less than half for the “Practice” class were predicted correctly. Overall the accuracy of the model was:
| Accuracy |
|---|
| 0.9140625 |
This accuracy looks much nicer compared to the accuracy of the Naive Bayes model, however the null model would be able to achieve roughly 70% accuracy simply by choosing “Race” every time. So while this model is better at predicting classes than the null, it is only better by a few percentage points. Looking at the plotted points in scatterplots, it would be worth attempting to classify this based on k-NN since there are similar groupings. However, we still have proven that the Logistic Regression model is more accurate and thus we have shown to have sufficient statistical evidence to say the Logistic Regression model is better than the null model.
| avg_duration | avg_duration_sec1 | avg_duration_sec2 | avg_duration_sec3 | avg_i1_speed | avg_st_speed |
|---|---|---|---|---|---|
| Min. :-4.72438 | Min. :-2.1773 | Min. :-5.204720 | Min. :-2.0222 | Min. :-3.85844 | Min. :-9.37728 |
| 1st Qu.:-0.05902 | 1st Qu.:-0.1115 | 1st Qu.:-0.438566 | 1st Qu.:-0.5047 | 1st Qu.:-0.68549 | 1st Qu.:-0.56746 |
| Median : 0.33213 | Median : 0.4375 | Median : 0.067485 | Median :-0.1098 | Median : 0.28031 | Median : 0.03809 |
| Mean : 0.34474 | Mean : 0.3200 | Mean : 0.003027 | Mean : 0.1556 | Mean : 0.04054 | Mean :-0.17133 |
| 3rd Qu.: 0.72146 | 3rd Qu.: 0.8590 | 3rd Qu.: 0.776976 | 3rd Qu.: 0.3771 | 3rd Qu.: 0.77206 | 3rd Qu.: 0.49404 |
| Max. : 5.09201 | Max. : 3.8076 | Max. : 2.381576 | Max. : 5.1511 | Max. : 1.93802 | Max. : 1.57330 |
| avg_duration | avg_duration_sec1 | avg_duration_sec2 | avg_duration_sec3 | avg_i1_speed | avg_st_speed |
|---|---|---|---|---|---|
| Min. :-3.5507 | Min. :-2.0480 | Min. :-2.799308 | Min. :-2.064175 | Min. :-3.31636 | Min. :-3.45631 |
| 1st Qu.:-1.3935 | 1st Qu.:-1.1736 | 1st Qu.:-0.599634 | 1st Qu.:-0.817456 | 1st Qu.:-0.95683 | 1st Qu.:-0.05357 |
| Median :-0.9971 | Median :-0.9676 | Median : 0.129379 | Median :-0.454300 | Median :-0.08776 | Median : 0.61466 |
| Mean :-0.8754 | Mean :-0.8125 | Mean :-0.007686 | Mean :-0.395036 | Mean :-0.10295 | Mean : 0.43507 |
| 3rd Qu.:-0.6438 | 3rd Qu.:-0.6712 | 3rd Qu.: 0.604307 | 3rd Qu.: 0.008255 | 3rd Qu.: 0.83364 | 3rd Qu.: 1.15002 |
| Max. : 2.7823 | Max. : 2.4860 | Max. : 2.324628 | Max. : 2.914064 | Max. : 2.15583 | Max. : 1.75966 |